Digital Forensics

Project

Faked Speech Detection with Zero Knowledge

Submitted by Anand Prakash Saini

Original Paper: https://arxiv.org/abs/2209.12573, Authors: Sahar Al Ajmi, Khizar Hayat, Alaa M. Al Obaidi, Naresh Kumar, Munaf Najmuldeen (University of Nizwa, Sultanate of Oman)

Problem Statement:

In the era of rapid advancement of artificial intelligence and digital audio processing technologies has made the creation of faked or mimicked speech not only possible but also increasingly convincing. This development poses significant risks, ranging from personal identity theft to misinformation and security threats. Nowadays when digital communication predominates, the authenticity of audio content has become a critical concern.

Introduction:

This research work introduces a neural network-based method to distinguish real from mimicked speech without relying on any reference or real source audio, termed as 'zero-knowledge' detection.

It leverages a range of sophisticated audio feature extraction techniques, including spectral and cepstral analyses, to train a model capable of identifying faked speech with remarkable accuracy.

The model's ability to operate in a blind mode, without the need for comparison against authentic audio, sets a new precedent in the field of digital audio forensics.

Audio Processing Background:

Audio Vs Speech: Audio is a waveform data in which the amplitude changes with respect to time whereas Speech is the oral communication and pertains to the act of speaking and expressing thoughts and emotions by sounds and gestures.

An audio signal is a representation of sound in function to the vibration of sound that is audible to human ear (20Hz to 20kHz). For machines, the processing of audio is different from humans. In order for the machine to get a sound it should have the recorder and then save the sudio in machine processable formats like mp3, WMA,WAV etc.

Tempral and Spectral features: Features from speech signals can be broadly classified as temporal and spectral features. The temporal features are time domain features having simple physical interpretation and easy to compute.

The spectral features, on the other hand, are frequency-based features that are extracted after passing the time domain signal to the frequency domain using Fourier or other similar transforms.

In the context of audio signals such features may be helpful in the identification of pitch, notes, rhythm and melody etc.

Spectrum and cepstrum are two important frequency-based concepts in audio processing:

The Dataset

The data is collected from a number of social media apps and sites and downloaded the audios which were then edited to conform to the proposed model by limiting it to a maximum duration of 20 seconds in WAV format.

One part of the dataset consists of all English audios (both real and mimicked). The second part of the dataset contains a mix of both English and Arabic audios.

The audio files are named so that the first four characters are digits to represen the index and the fifth character is either ’r’ or ’f’ to label the voice as real or faked, respectively. the raw dataset is available for public on GitHub (https://github.com/SaSs7/Dataset).

Our goal is to blindly identify whether a given voice is mimicked or otherwise. Hence for our experiments, a set of independent real and faked audios was required to create the dataset; real and faked voices uttered independently of what is being said and who said it and independent of the language.

Methodology

The use case of this method concerns the scenario of a complete blindness wherein no prior or side information is available about the speaker.

This work's emphasis is on classifying the speech, as faked, or otherwise, under the assumption that no other recorded voice of the speaker is available; whether genuine or disguised. T

he purported speaker is only represented once in the training data; that too either as real or mimicked, but not both. Potentially, this work's method may be very useful in improving efficiency of many audio processing methods, especially, when applied at the pre-processing stage.

The Proposed Method

image.png

Input

For the training set, it is essential that:

Keeping the above in view, We extracted our dataset from the raw dataset to contain 933 English-only audio samples and 1127 samples in both English and Arabic.

Feature extraction

Where $a$ is peak value, $n$ the order of the filter, $b$ the bandwidth, $f_{c}$ the characteristic frequency and $\phi$ is initial phase. $f_{c}$ and $b$ can be derived from Equivalent Rectangular Bandwidth (ERB) scale, using the following equation: $$ \text{ERB}(f_{c}) = 24.7 \cdot (4.37 \cdot \frac{f_{c}}{1000} + 1) $$ $$ b = 1.019 \times \text{ERB}(f_{c})$$

Preprocessing:

The data is standardized, which means that they will have a mean of 0 and a standard deviation of 1.

The Neural Network:

Using sklearn.model_selection, the feature set is first partitioned to training and testing sets. During the training phase, the training set is dynamically partitioned to training and validation parts.

image.png

All English Audios:

Evaluation of English only model on arabic only audios

Mixed - English and Arabic

Further Experiment

CNN Model:

CNNs are highly effective in capturing local patterns in data. In the context of audio feature analysis, they can efficiently recognize patterns in spectral features and MFCCs, which are critical for distinguishing real from faked speech.

To Avoid Overfitting:

Conclusion